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Abstract 

Background: It is a great challenge of modern biology to determine the functional roles of non-synonymous 
Single Nucleotide Polymorphisms (nsSNPs) on complex phenotypes. Statistical and machine learning techniques 
establish correlations between genotype and phenotype, but may fail to infer the biologically relevant 
mechanisms. The emerging paradigm of Network-based Association Studies aims to address this problem of 
statistical analysis. However, a mechanistic understanding of how individual molecular components work together 
in a system requires knowledge of molecular structures, and their interactions. 

Results: To address the challenge of understanding the genetic, molecular, and cellular basis of complex 
phenotypes, we have, for the first time, developed a structural systems biology approach for genome-wide 
multiscale modeling of nsSNPs - from the atomic details of molecular interactions to the emergent properties of 
biological networks. We apply our approach to determine the functional roles of nsSNPs associated with hypoxia 
tolerance in Drosophilo melonogoster. The integrated view of the functional roles of nsSNP at both molecular and 
network levels allows us to identify driver mutations and their interactions (epistasis) in H, RadSID, Ulpl, WntS, 
HDAC4, Sol, Dys, GalNAc-T2, and CG33714 genes, all of which are involved in the up-regulation of Notch and 
Gurken/EGFR signaling pathways. Moreover, we find that a large fraction of the driver mutations are neither 
located in conserved functional sites, nor responsible for structural stability, but rather regulate protein activity 
through allosteric transitions, protein-protein interactions, or protein-nucleic acid interactions. This finding should 
impact future Genome-Wide Association Studies. 

Conclusions: Our studies demonstrate that the consolidation of statistical, structural, and network views of 
biomolecules and their interactions can provide new insight into the functional role of nsSNPs in Genome-Wide 
Association Studies, in a way that neither the knowledge of molecular structures nor biological networks alone 
could achieve. Thus, multiscale modeling of nsSNPs may prove to be a powerful tool for establishing the functional 
roles of sequence variants in a wide array of applications. 
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Background 

Recent advances in next generation sequencing have gen- 
erated abundant genetic variants and "omics" data. 
Together, these extremely large, multidimensional data- 
sets present an exciting opportunity to identify genes, 
and to predict pathways likely to be involved in diseases 
and traits. However, these complex data sources plus the 
broad spectrum of phenotypes, challenge the quest to 
uncover the genetic, molecular, and cellular mechanisms 
that underlie phenotypes [1-3]. A major challenge in 
deciphering the genetic basis of multigenic diseases or 
traits is to distinguish driver mutations that impact the 
survival or reproduction of a particular phenotype (e.g., 
cancer) from passengers that do not confer a selective 
advantage. Standard genome sequence analysis cannot 
detect all driver mutations due to difficulties in the esti- 
mation of the background mutation rate and underlying 
genetic heterogeneity of adaptive phenotypes [4,5]. Statis- 
tical machine learning techniques (e.g., SNAP [6]) pro- 
vide an alternate approach by learning from the 
annotated mutation data. However, the "black-box" nat- 
ure of machine learning makes it difficult to interpret the 
novel functional roles of mutations. Parallel to the devel- 
opment of new genotyping and phenotyping techniques, 
a number of novel computational tools have been devel- 
oped to integrate and analyze genetic and omics data 
with the aim of establishing statistical causal relationships 
between genetic markers, genome-wide molecular signa- 
tures, and organismal phenotypes [7-13]. For example, 
co-expression and Bayesian network models derived 
from DNA variances and genome-wide transcriptional 
profiles have been applied to identify causal disease genes 
[14], cancer drivers [10,15], and master regulators of can- 
cer [16-18]. Although great efforts have been made to 
address n«p problem, where the number of observations 
n (e.g., gene expressions in different conditions) is much 
smaller than the number of variables or parameters 
p (e.g., all measured genes), the power of these statistics- 
based techniques is still limited if sample sizes are small. 
Moreover, the complex phenotype is often associated 
with interactions among multiple causal genes (epistasis), 
any of which alone is not sufficient to drive phenotypic 
change. It is challenging for statistical methods to identify 
epistasis given the large number of possible interactions. 
Fundamentally, the "causal" relationships inferred from 
these methods are mathematical correlations. They may 
not provide biological insight into the underlying mole- 
cular and cellular mechanisms that associate genotypes 
with phenotypes. 

A mechanistic understanding of how individual molecular 
components work together in a system, and how the system 
is affected and adapted to individual changes, requires 
knowledge of molecular structures, their interactions, and 



their conformational dynamics [19]. Conversely, a priori 
knowledge of structures, their interactions and dynamics 
may facilitate the identification of causal mutations and 
their interactions from noisy data even where statistical 
techniques fail. In this paper, we have developed an inte- 
grated multiscale modeling framework to decipher the 
impact of non-synonymous Single Nucleotide Polymorph- 
isms (nsSNPs) on the information flow from the activity of 
a single molecular component, to the function of the com- 
plete molecular machinery, and ultimately to the emergent 
properties of the biological network. Conceptually, our 
approach is rooted in Crick's central dogma of molecular 
biology and Blois's scalar theory of biomedical information 
[20]. The fundamental concept of scalar theory is that com- 
plex phenotypes arise from the emergent properties of 
lower scales in the hierarchy which themselves have an 
intermediate phenotype (or mesophenotype). Based on sca- 
lar theory, an organismal phenotype (e.g., disease) emerges 
from dysregulated pathways that can be identified by gen- 
ome-wide signatures such as gene expression profiles. The 
change of the genome-wide signature between disease and 
normal states results from the altered molecular machinery 
in the cell, which includes abnormal molecular interactions. 
In turn, the molecular interaction is determined predomi- 
nately by the shape, dynamics and physiochemical proper- 
ties of the associated biomolecules - properties changed by 
genetic modifications. From an algorithmic point of view, 
the task is to predict the response of the mesophenotype to 
the emergent properties of the lower scale, and then use 
that prediction as input to the upper scale. This is different 
from current paradigms that often bypass one or more 
intermediate phenotypes. In practice, each level can be stu- 
died independently and then integrated for an improved 
outcome (Figure 1). In this paper, our contributions are 
three-fold. First, we address the challenges of identifying 
causal mutations and epistasis in Genome- Wide Associa- 
tion Studies (GWAS) data when the sample size is extre- 
mely small. We do so by incorporating a priori knowledge 
of protein structure, evolution and interaction, and cellular 
signaling and regulatory pathways. In principle, it allows us 
to identify driver mutations de novo. Second, we show evi- 
dence that a large fraction of driver mutations may be 
involved in perturbation of protein-protein interaction and 
protein-nucleic acid interactions, and alternation of molecu- 
lar allosteric regulation; molecular mechanisms that have 
been paid too little attention in GWAS thus far. Third, we 
introduce a new method to identify mutation mediated 
pathway profiles, which can be used to prioritize driver 
mutations and epistasis, by integrating sequence variances, 
protein-protein interaction networks, and gene expression 
profiles. 

To demonstrate the feasibility of our approach, we apply 
multiscale modeling to reveal the genetic, molecular, and 
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Figure 1 A multiscale modeling strategy to integrate statistical machine learning, protein structural analysis, and biological network 
analysis. 



cellular basis of hypoxia, a physiological condition in 
which the cell is deprived of an adequate oxygen supply. 
The hypoxia- induced phenotype has been related to multi- 
ple pathological conditions including cancer [21]. Cells, 
tissues, and organisms have developed different strategies 
to survive low oxygen levels; however, the underlying 
molecular mechanisms contributing to hypoxia tolerance 
remain unclear. To render mammalian cells and tissues 
resistant to a low 02 environment, Drosophila melanoga- 
ster {D. melanogaster) has been used as a model system to 
investigate the mechanisms underlying hypoxia tolerance. 
Through long-term laboratory selection, Zhou et al. have 
generated D, melanogaster populations that tolerate 
severe, normally lethal, levels of hypoxia [22] . Microarray 
analysis identified several adaptive changes in the hypoxia- 
selected flies [22]. Comparison between the genome 
sequences of hypoxia-selected flies and those of controls 
identified 107 amino acid mutations in 52 genes [23]. 
These data provide us with an unparalleled opportunity to 
understand the genetic, molecular, and cellular basis of 
the hypoxia tolerance phenotype and to develop new com- 
putational tools to establish causal genotype-phenotype 
associations, which can be validated through controlled 
experiments. It is noted that the gene expression profiles 
are only measured for one condition in the hypoxia 



tolerance phenotype, hence conventional co-expression 
approaches are not applicable to this study. Although the 
hypotheses generated from this study have been experi- 
mentally validated by us and are consistent with experi- 
mental results from others, the sensitivity and specificity 
of the method has not been fully evaluated. In the future 
we will extensively test our method using large case- con- 
trol datasets from public databases such as the NCBI data- 
base of genotypes and phenotypes (dbGap) [24] and the 
Welcome Trust Case Control Consortium ( WTCCC) [25] . 

Results 

Knowledge-driven network inference of driver mutations 
responsible for hypoxia tolerance 

Complex phenotypic changes typically arise from re-regu- 
lated cellular signaling and regulatory pathways (core path- 
ways). As there are often multiple genes involved in a core 
pathway, a large number of combinations of genetic altera- 
tions can lead to the up- or down-regulation of a pathway. 
Our hypothesis is that driver mutations will collectively 
contribute to the re-regulation of a core pathway, which 
manifests itself as a change to the genome-wide signature, 
measured here by differentially expressed genes between 
hypoxia and normoxia phenotypes. In this way it is possible 
to identify the pathway involved in genotype-phenotype 
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associations from the interacting gene networks that con- 
nect the mutated genes to the differentially expressed 
genes. The pathway that appears more frequently than by 
chance is a potential core pathway. If the hypothesized 
core pathway is validated by experiments, or consistent 
with prior knowledge, the association with the core path- 
way can prioritize the driver mutation. Based on this ratio- 
nale, we developed a knowledge-driven network analysis 
method (Figure 2). First, both mutated genes and differen- 
tially expressed genes are labeled in the protein-protein 
interaction (PPI) network. Second, a mutation seeded sub- 
network (MSSN) that connects the mutated gene (seed) 
and the up- or down- regulated genes (targets) is then iden- 
tified for each of the mutated genes if the length of the 
path between the mutated gene and the re-regulated gene 
is shorter than randomly selected paths. Intuitively, the 
mutated gene will have a bigger impact on the differentially 
expressed genes if the distance between them is shorter. 
Third, the overrepresented biological pathways in the 
MSSN are identified using BiNGO, a tool for Gene Ontol- 
ogy Over-representation Analysis. The most frequently 
overrepresented biological pathways for the complete 
MSSN are hypothesized to be core pathways, and validated 
by experiment. Finally, the putative driver mutations are 
ranked by: (1) the statistically significant shortest distance 
between the mutated gene and the differentially expressed 
genes in the MSSN, and (2) the statically significant 
enriched core pathways. This pathway analysis of the 
MSSN identifies four core pathways: up-regulated Notch 
and Gurken/Epidermal Growth Factor Receptor (EGFR), 
and down-regulated Toll and Torso/Receptor Tyrosine 
Kinase (RTK) pathways. Using Notch signaling inhibition 
and a P-element screen, we have experimentally validated 
that the up-regulation of Notch signaling is critical to the 
survival of hypoxia tolerant Drosophila strains [23,26]. 



Thus the up-regulation of Notch signaling is confirmed as 
a driver for hypoxia tolerance in D. melanogaster. 
Although more experiments are needed to validate the 
direct association of other pathways with hypoxia toler- 
ance and their potential cross-talk with Notch signaling, 
the mutation could be a driver if it up-regulates Notch sig- 
naling. As shown in Table 1, nine MSSNs show statisti- 
cally significant enrichment (FDR corrected p-value < 
0.05) for up-regulation of Notch signaling pathways and 
significantly shorter paths between the mutated gene and 
differentially expressed genes. An immediate question is, 
what are the underlying molecular mechanisms associated 
with these putative driver mutations? If these mutations 
are non-neutral at the molecular level, it provides addi- 
tional support for our hypothesis. 

Structural analysis of functional roles of nsSNPs 
Structural modeling of nsSNPs 

To better understand the molecular basis of potential 
driver mutations, we mapped point mutations to protein 
structure models. Among the 52 proteins containing 
nsSNPs, none of them have known structures available 
in the RCSB Protein Data Bank (PDB) [27]. Homology 
models were built for these proteins. The distribution of 
the sequence identities associated with the structural 
templates used in modeling is shown in Figure 3. 60% of 
models are based on a template with a sequence identity 
greater than 30%, a common threshold for building reli- 
able homology models. These protein models can be 
grouped into four categories: 1) Reliable models can be 
built and the locations of mutations are close to known 
functional sites. Thus, the functional role of nsSNPs can 
be predicted in a relatively straightforward manner. 
These proteins (6 total) are listed in Additional File 1 
table SI and model structures of these proteins are 
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Figure 2 Workflow to determine core pathways and driver mutations. (A) mutated genes (blue filled circle) and differentially expressed 
genes (green filled circle) are mapped to a protein-protein interaction network in which the circles and lines represent proteins and interactions 
between them, respectively. (B) The shortest-path algorithm is applied to construct subnetworks by linking the mutated genes to up-, or down- 
regulated genes, respectively. (C) BiNGO is applied to identify overrepresented biological pathways. (D) The experimentally validated driver 
pathway is used to rank the driver mutations. 



Table 1 Predicted driver mutations and core pathways for hypoxia tolerance in Drosophila melanogaster from multiple evidences. 



Mutated Gene Molecular Function FDR Corrected p-value for the Shortest-path Functional role of nsSNP Expected accuracy (%) of Human ortholog 

(Annotation overrepresentation of signaling Distance (z-score) inferred from structural non-neutral mutation from and hypoxia 

Symbol) pathways up/down modeling SNAP [6] association 



Up-regulation Down-regulation 







Notch* 


Gurken/ 
EGFR 


Toll 


Torso/RTK 










Hairless (CG5460) 


transcription 
CO re pressor 


1.01e-5 


2.50e-3 


8.23e-3 


5.76e-5 


2.44/4.42 


Possible DNA binding 


82 


Yes [31] 


RadSID (CG6318) 


DNA-dependent 
ATPase 


3.36e-2 


1.20e-2 


1.95e-2 


1.42e-3 


2.54/4.09 


PPI 


<50 


Yes [30] 


Ulpl (CGI 2359) 


SUMO-specific 
protease 


4.68e-2 


1.87e-2 


>0.05 


>0.05 


1.78/3.86 


unknown 


63 


Yes [43] 


Wnt5 (CG6407) 


receptor binding 


2.67e-2 


1.97e-3 


1.06e-2 


1.16e-7 


1.26/3.41 


unknown 


58 


Yes [36-42] 


HDAC4 (CGI 770) 


histone deacetylase 4 


2.70e-2 


4.10e-4 


>0.05 


5.76e-5 


1.11/3.13 


AR of catalytic activity 


<50 


Yes [34] 


Sol (CGI 391) 


calcium-dependent 

cysteine-type 

endopeptidase 


1.51e-2 


1.69e-2 


1.55e-2 


2.11 e-3 


0.33/2.82 


unknown 


<50 


unknown 


Dys (CG34157) 


Dystrophin 


8.28e-5 


8.05e-5 


>0.05 


3.17e-3 


0.38/0.72 


AR of substrate binding 


70 


Yes [32,33] 


GalNAc-T2 
(CG6394) 


N-acetylgalactosaminyl 
transferase 


2.59e-3 


2.71 e-10 


1.30e-2 


3.28e-3 


-1.51/0.81 


AR of substrate binding 


<50 


Yes [35] 


CG33714 
(CG33714) 


mRNA binding 


8.59e-5 


5.61 e-3 


1.51e-2 


>0.05 


-1.59/0.69 


mRNA binding 


87 


unknown 



^We experimentally validate that the up-regulation of Notch signaling, one of the most frequently overre presented pathways, confers the hypoxia tolerance in Drosophila melanogaster [23, 26]. PPI: Protein-Protein 
Interaction, AR: Allosteric Regulation. 
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Figure 3 Distribution of sequence identities between proteins containing nsSNPs and templates in the RCSB PDB. 



shown in Additional File 1 Figure SI. 2) Reliable models 
can be built but the point mutation cannot be mapped 
to any functional sites. This category includes 11 pro- 
teins (Table S2). 3) No structural templates can be 
found for the domains containing the mutation, but 
structural models can be built for other functional 
domains of the same protein. 21 proteins are in this 
category (Table S3). 4) No structural templates can be 
found for the whole protein or any part; 15 proteins are 
in this category. 

Structural roles of putative driver mutations 

For putative mutations in nine genes predicted from net- 
work analysis (Table 1), we first analyze if the mutation 
might alter substrate binding, catalytic activity, or struc- 
tural stability. Interestingly, most of the predicted driver 
mutations are surface-exposed, but not located in con- 
served functional sites. We hypothesize that they may be 
involved in allosteric regulation, protein-protein interac- 
tions, or protein-nucleic acid recognition. Co-evolution 
analysis is applied to these proteins to identify the corre- 
lation between mutated amino acids and functional sites. 
The residue couplings were observed in four structures 
(HDAC4, Dys, GalNAc-T2, and CG33714). One example 
is HDAC4, which belongs to the histone deacetylase 
family. As shown in Figure 4, residues that are predicted 
to be co-evolved with A1075, one of the mutations in 
HDAC4, form zinc binding sites. Among them, the two 
His residues around the zinc ion are conserved in all 



members of the class Ila histone deacetylase family. The 
mutation of residues coordinating the zinc ion was 
reported to prevent the association of HDAC4 with the 
N-CoR- HDAC3 repressor complex [28], which is 
required for HDAC4 to possess histone deacetylase activ- 
ity [29]. Thus, A1075 is functionally coupled to the zinc 
binding site in HDAC4, and as a consequence, may 
remotely regulate its activity. More examples are shown 
in the Additional File 1 Figures S2-S4. 

In addition to allosteric regulation, the putative driver 
mutation may modify protein-protein interactions. This 
is the case for RadSlD, as shown in Figure 5. RadSlD 
plays a major role in homologous recombination repair 
(HRR) of damaged DNA arising during replication or 
induced by DNA damaging agents. BRC repeat (BRCA2 
in Figure 5) mimics a motif in RadSlD that serves as an 
interface for oligomerization between individual Rad51D 
monomers. One of the RadSlD mutations, S55N, is close 
to the oligomerization interface between individual 
RadSlD monomers and hence may impact the formation 
of the RadSlD complex, which is associated with the 
hypoxia phenotype [30]. 

Machine learning based prediction of non-neutral nsSNPs 

The functional importance of nsSNP is further supported 
by SNAP [6], software used to predict a given nsSNP as 
neutral or non-neutral with an expected accuracy. In a 
benchmark study, SNAP outperformed most similar 



Xie et al. BMC Genomics 2013, 14(Suppl 3):S9 
http://www.biomedcentral.eom/1 471 -21 64/1 4/S3/S9 



Page 7 of 1 2 





Figure 4 The model structure of HDAC4. The yellow spheres represent mutated amino acid A1075. Green sticks represent residues co-evolved 
with A1075. Red circle represents zinc binding sites of HDAC4. This model structure is built using Modeller [65] based on the sequence 
alignment between HDAC4 and PDB structure 2VQW. The sequence identity between HDAC4 (819-1223) and 2VQW is 58%. 



methods [6]. 23 out of the 107 nsSNPs, located on 18 
genes, are predicted as non-neutral with an accuracy of 
higher than 58% (SNAP reliability index 0), (Additional 
File 1 Table S4). Five predicted non-neutral mutations 
are hypothesized as putative drivers. Two of them (H and 
CG33714) have an accuracy of over 80%. The remaining 
predictions have lower expected accuracies. This could 
imply that while the functional impact of each individual 
mutation is limited, collectively they may mediate the sig- 
naling pathway activity through epistasis. 

Several mutations in CG31220 (Additional File 1 
Table S4), a serine- type peptidase, are predicted as non- 
neutral by SNAP. These mutations are mapped to the 



substrate binding sites or other functional important 
regions in the structure (Additional File 1 Figure SI). 
However, enriched biological pathways associated with 
this gene were not detected. More studies are required 
to understand how these non-neutral mutations impact 
the biological network. 

Experimental and literature supports 

As discussed above, a complex phenotype rises from re- 
regulated biological pathways that themselves result 
from the collective effects of multiple genetic mutations 
(epistasis). Since the down- or up-regulation of core 
pathways directly impacts the organismal phenotype, the 
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Figure 5 Model structure for RadSID. The green and cyan cartoon represents the model structure of RadSlD and BRCA2, respectively. Dotted 
spheres represent the mutated amino acids. The red circle indicates the region of the protein-protein interaction interface. Ser55 mutation may 
directly impact the oligomerization of RadSlD. 



experimental validation of the core pathway would pro- 
vide strong evidence to support the predicted driver 
mutations that are responsible for the re-regulation of 
the core pathway. Indeed, we have experimentally vali- 
dated that Notch signaling is the core pathway of 
hypoxia tolerance in D. melanogaster. The reduced acti- 
vation of Notch signaling by a specific y-secretase inhi- 
bitor significantly reduces the survival and life-span of 
hypoxia tolerant D, melanogaster strains [23] . The criti- 
cal role of Notch signaling in hypoxia tolerance is 
further supported by UAS-Gal4 over-expression and 
RNAi knockdown of genes involved in Notch signaling 
[26]. Other experimental evidence from the literatures, 
as detailed below, also support our predictions. The top 
ranked H gene (also called hairless) is a well-known reg- 
ulator of Notch signaling in D, melanogaster [31]. Dys 
encodes the protein dystrophin. Genetic interaction 
screens in D, melanogaster have shown that Dys is 
involved in interactions with components of the Notch 
signaling pathway [32]. Furthermore, the mutation of 
the Dys homolog in the mouse model is related to the 
up-regulation of the Notch-beta pathway [33]. For other 
genes, although little direct experimental evidence sup- 
ports an association with hypoxia in D, melanogasta 



their functional roles in hypoxia has been demonstrated 
in cancer and other human diseases. HDAC4 regulates 
hypoxia-inducible factor 1 a (HIFl a) and cancer cell 
response to hypoxia [34]. GalNAc-T2 is an N-acetyl- 
galactoseaminyl transferase that catalyzes the synthesis 
of glycosphingolipid (GSL). A recent study has shown 
that GSL may directly regulate the activity of Notch sig- 
naling [35]. WntS is a ligand to a family of frizzled 
receptors, acting as a regulator of Wnt signaling. An 
increasing body of evidences suggests that Wnt and 
Notch signaling cooperatively determine the fate of cell 
development in humans [36-42]. The association 
between RadSlD and hypoxia has been demonstrated in 
cancer [30]. Ulpl is a SUMO-specific protease that is 
essential for the stabilization of HIFla during hypoxia 
by removing SUMO and participates in the regulation 
of hypoxia-responsive genes [43] . 

Discussion 

The important functional role of allosteric regulation, 
protein-protein interactions, and protein-nucleic acid 
interactions in sequence variants 

In this study, none of the driver mutations associated 
with hypoxia are conserved functional site residues, nor 
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are they responsible for structural stability. The driver 
mutations are hypothesized to be involved in either pro- 
tein-protein interactions (in the case of RadSlD), pro- 
tein-nucleic acid interaction (e.g., in CG33714), or 
allosteric regulation (e.g., in HDAC4). A recent survey 
of the structural basis of in-frame mutations in protein- 
protein interactions has suggested that changes in speci- 
fic interactions play a critical role in pathogenesis [44]. 
From a network point of view, the modification of pro- 
tein-protein interactions, rather than the proteins them- 
selves, may have significant impact on network 
properties [45]. Recent progress in the ENCODE and 
modENCODE projects highlights the critical functional 
roles of non-coding DNAs in the regulation of biological 
processes [46,47]. As a large number of non-coding 
DNAs perform their functions through specific protein- 
nucleic acid interactions, the mutations that impact pro- 
tein-nucleic acid binding could be directly associated 
with phenotype changes. The dysregulation of allosteric 
interactions is considered to be another major determi- 
nant of disease [48]. During evolution, organisms need 
to survive and reproduce in a changed environment. As 
such, certain genes need to gain functions and activate 
critical pathways. Allosteric regulation is an efficient 
way for driver mutations to act since the change of 
activity is not constrained to a single molecule, but can 
be propagated to a whole network [19]. New computa- 
tional methods that are able to identify "hot spots" in 
protein-protein interactions, protein-nucleic acid recog- 
nition, and allosteric regulations, in which the mutation 
may cause the dysregulation of biological pathways, may 
have significant impact on the interpretation of Gen- 
ome-Wide Association Studies. 

The relevance of D. melanogasta driver mutations to 
liuman hypoxia adaption 

Recently several studies in hypoxia adaptation in humans 
have been performed on Tibetans [49,50], Andeans [50], 
and Ethiopians [51]. However, all human studies to date 
have adopted limited, sampling-based approaches, such as 
genotyping or exome sequencing. The relatively sparse 
sampling of the genome makes it harder to identify large- 
scale shifts in the allele frequency spectrum associated 
with natural selection. Consequently, these studies 
restricted subsequent analysis to variants in candidate 
genes that are mainly involved in the canonical hypoxia 
response (HIE pathway) and related pathways. The identi- 
fication of the functional roles of sequence variances in 
human orthologs of Drosophila genes may provide critical 
insight in the prioritization of candidate genes in human, 
which may fail using conventional statistical techniques. 
Indeed, the majority of driver mutations identified in this 
study are human orthologs and associated with the 
hypoxia cellular phenotype, as shown in Table 1. 



Conclusion 

Based upon multiscale modeling, we propose that the 
up-regulation of Notch and Gurken/EGER and the down- 
regulation of Toll and Torso/RTK pathways are responsi- 
ble for hypoxia tolerance. Using integrated structural and 
network analysis, we hypothesize that nsSNPs in H, 
Rad51D, Ulpl, Sol, Wnt5, CG33714, GalNAc-T2, Dys, 
and HDAC4, may all lead to the functional modification 
of these genes via allosteric regulation and protein-pro- 
tein/DNA/RNA interactions and hence are driver muta- 
tions defining the hypoxia tolerance phenotype. Our 
predictions are supported by experimental evidence 
[23,26]. Moreover, multiscale modeling may identify 
potential epistasis using a very small sample size. This 
reduces the burden imposed during statistical multiple 
testing of large epistasis models. It is anticipated that the 
further extension of this multiscale modeUng approach to 
genome-wide protein-protein interactions, protein-nucleic 
acid interactions, and microRNA data will provide a 
powerful tool for uncovering the functional roles of both 
coding and non-coding sequence variations in GWAS; a 
role which neither the knowledge of molecular structures 
nor of biological networks alone can achieve. However, 
challenges remain in extending multiscale modeling 
approaches. New algorithms are required to predict emer- 
gent properties, at both molecular and network levels, as 
well as to seamlessly model information flow across scales. 

Methods 

Prediction of non-neutral mutations on nsSNPs from 
sequence 

A sequence information based method, SNAP [6] is 
used to predict the non-neutral (functional effect) and 
neutral (no functional effect) nsSNPs. 

Knowledge-driven network inference of core pathways 

and driver mutations 

Overview 

The network-based analysis of driver mutation is shown in 
Eigure 2. The mutated genes and differentially regulated 
genes are mapped to a protein-protein interaction (PPI) 
network extracted from the STRING Database [52] for D, 
melanogaster. A subnetwork that connects a mutated gene 
and up-, and down-regulated genes is identified using a 
shortest path search of the PPI network. The genes identi- 
fied in each subnetwork are subject to Gene Set Enrich- 
ment Analysis (GSEA). If the genes in the subnetwork are 
enriched by the essential biological processes/pathways, 
the mutated gene is a potential driver. 
Analysis of differential expressed genes 
A cDNA microarray analysis of 13,061 known or pre- 
dicted genes from the D, melanogaster genome is per- 
formed using the R package [53]. K-nearest neighbors 
[54] in the space of genes is used to impute missing 
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expression values. The LOWESS normalization method 
[55] is used to normalize the raw density data. P-value and 
fold change are calculated using the two-sided, two-class 
t-test [56]. A Bonferroni-Holm [57] false discovery rate 
(FDR) controlling procedure [58,59] is used to adjust the 
P-values. The genes are considered to be differentially 
expressed between the two samples when the FDR is smal- 
ler than 0.05. If the fold change is larger than 1.5-fold for 
up-regulated genes and is smaller than 0.67-fold for down- 
regulated genes, these genes are considered significantly 
differentially expressed. 

Subnetwork construction by shortest path search 

A program based on Dijkstra's algorithm [60] is developed 
to search for the shortest path from a source node 
(mutated gene) to a destination node (differentially 
expressed gene) in a protein-protein interaction (PPI) net- 
work. A shorter path implies that the mutated gene has 
stronger influence on the differentially expressed genes. All 
genes along the path form a subnetwork. In order to obtain 
a quantitative measurement to distinguish the different 
topologies in the subnetworks, a t-value based on Welch's 
t-test is calculated for each mutated gene. The Welch's t- 
test [61] calculates the difference of two populations whose 
variances are assumed to be different (unequal sample size 
and unequal variance). The t-value is calculated as follows: 

Y til nj 

Where is the unbiased estimator of the variance of the 
sample and n is the number of participants. 

Here the t-value is used to measure the difference 
between the identified subnetwork (X2) and a background 
random network (xi). Background random networks are 
built by randomly selecting one gene as a source node and 
a set of other genes as destination nodes. A positive 
t-value means a shorter than average path. The mutations 
on the genes with statistically significant high t-values are 
prioritized as driver mutations. 
Gene set overrepresentation analysis to identify driver 
biological pathways and mutations 

The Biological Networks Gene Ontology Tool (BiNGO) 
[62] is applied in Cytoscape's versatile visualization envir- 
onment [63] to determine which biological processes and 
molecular functions are significantly overrepresented in 
the set of genes involved in each subnetwork. Gene ontol- 
ogy [64] terms are ranked according to the False Discovery 
Rate (FDR) corrected p-values for each subnetwork. The 
statistically significant enriched biological pathways (p- 
value < 0.05) are considered as potential core pathways 
that contribute to the survival or reproduction of a pheno- 
type. This pathway is subject to further validations by 
experiments and literature searches. If a subnetwork 



contains the validated core pathway, the mutated gene in 
this subnetwork is hypothesized to be a causal gene. Cor- 
respondingly, the mutations on this gene are candidate 
driver mutations. 

Structure-based analysis of driver mutations 
Homology modeling and nsSNP mapping 

Homology models of proteins are built using Modeller 
[65]. Sequence alignments between these proteins and 
templates of known structures are obtained from a PSI- 
BLAST sequence search [66]. The functional sites are 
predicted using SMAP [67-69]. Mutated residues are 
mapped onto the model structures and the functional 
roles of these residues are predicted according to their 
locations on the model structures. 
Covariance analysis 

Covariance analysis based on multiple sequence align- 
ments of proteins in the same Pfam family [70] as the 
mutated protein can help identify remote relationships 
between mutated residues and other residues within the 
protein sequence. The Pfam family is identified by a whole 
sequence search. Redundancy of sequences in the Pfam 
family is removed using CD-hit [71] with a sequence iden- 
tity threshold of 90% [72]. Multiple sequence alignments 
among these sequences are built using the MUSCLE soft- 
ware [73] with default parameters. Covariance of muta- 
tions with other residues is calculated using five different 
methods: Statistical Coupling Analysis (SCA) [74]; Explicit 
Likelihood of Subset Co-variation (ELSC) [75]; Observed 
Minus Expected Squared covariance algorithm (OMES) 
[76]; Mutual Information Covariance Algorithm (MI) [77]; 
and Conservation Algorithm (Conservations um) [78]. The 
residues that are predicted to be coupling with mutations 
by at least two methods are considered as co-evolved resi- 
dues with the mutated residues. 
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The data sets supporting the results of this article are 
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